What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions? You should use this section to motivate the statistical analyses that you decided to use in the next section.
The first task we had was to plot a lot of drawings all at once. Looking at drawings individually would have been very time consuming. We chose to overlay 500 drawings on top of one another of each type to get a sense of the variability of shapes for each drawing. Here are the plots for apples, mushrooms, and bread:
Even though we have plotted a ridiculous amount of images on top of one another, we can still see the underlying common object that inspired each class of drawings.
We asked the question: what if we didn’t connect the lines for each drawing? Since each drawing consists of a number of points connected in a path, we could just ignore how the points are connected and focus on the location of the points.
The images certainly become more difficult to recognize, but there is still a barely recognizeable shape there. This motivates our first approach: using the class distribution of the points to empirically construct a kernel that could help us classify the images.
Our next step was to create a kernel by smoothing over the empirical distribution of the points for each image class. We used kernel smoothing with a package called “ks” to go from the sample densities to a smoothed kernel.
We can still recognize some of the image classes from the kernels, however, for others we can’t. We went back to the data and realized there might be a number of ways that each class can be drawn. Let’s look at the watermelon, bananas, and peanuts.
It is very clear that people tended to draw two types of peanuts: a sideways peanut or a vertical peanut. Additionally, some people drew a half slice of watermelon while others drew a whole watermelon. If we look very carefully, we can see that there were many ways people chose to draw bananas. We can think of these variations as sub-classes. If we can somehow separate each class of drawings into less-variable sub-classes, and estimate a kernel for each subclass separately, our prediction algorithm might improve.